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(57) Methods and apparatus are provided for effi- 
ciently normalizing and denormalizing data for cryptog- 
raphy processing. The normalization and denormaliza- 
tion techniques can be applied in the context of a cryp- 
tography accelerator coupled with a processor/Hard- 



ware normalization techniques are applied to data prior 
to cryptography processing. Context circuitry tracks the 
shift amount used for normalization. After cryptography 
processing, the processed data is denqrmalized using 
the shift amount tracked by the context circuitry. 
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Description 

CROSS REFERENCE TO RELATED APPLICATIONS 

5 [0001] This application claims priority under U.S.C. 11 9(e) from U.S. Provisional Application No. 60/235,190, entitled 
"E-Commerce Security Processor," as of filing on September 20, 2000, the disclosure of which is herein incorporated 
by reference for all purposes. 

BACKGROUND OF THE INVENTION 

10 

1 . Field of the Invention. 

[0002] The present invention relates to normalization and denormalization of data. More specifically, the present 
invention relates to normalizing data for cryptography processing and denormalizing the processed output. 

15 

2. Description of the Prior Art 

[0003] Various hardware implementations for cryptography processing typically use software configured external 
processors to both normalize and denormalize data associated with cryptographic processing. Many methods for per- 

20 forming cryptography processing are well known in the art and are discussed, for example, in Applied Cryptography, 
Bruce Schneier, John Wiley & Sons, Inc. (1996, 2nd Edition), incorporated by reference in its entirety for all purposes. 
In orderto improve the speed of cryptography processing, specialized cryptography accelerators have been developed 
that typically out-perform similar software implementations. Examples of such cryptography accelerators include the 
Hi/fn™ 6500 and BCM™ 5805 manufactured by Broadcom, Inc. of San Jose, CA. 

25 [0004] Cryptography accelerators, such as the BCM™ 5805 and Hi/fn™ 6500 chips, typically use software configured 
external processors to provide normalized data or normalized numbers for cryptography processing. Generally, a float- 
ing point number having no leading zeros is referred to herein as a normalized number. For example, 1 .0 x 10 9 is in 
normalized floating point notation while 0.1 x 10 8 is not. In binary notation, the binary number "10100010" is a nor- 
malized binary number while the binary number "01 01 0001" is an unnormalized number. Typically, an unnormalized 

30 number is converted to a corresponding normalized number by, in the example of the binary numbers, performing a 
shift operation. Using the example from above, the unnormalized binary number "01 01 0001" is shifted left by one bit 
to provide the normalized binary number "10100010" which is now in condition to undergo cryptography processing. 
[0005] Generally, modifying the result of the cryptography processing by the previous shift amount provides a cor- 
responding denormalized number. Again, using the examples from above, if the unnormalized binary number 

35 "01010001" is shifted left one bit to form the normalized binary number "10100010" and cryptography processing on 
the normalized binary number "10100010" yields a result dataword "11001100", then normalizing the result dataword 
"11001100" using the normalizing shift amount results in a "denormalized" result data word "01100110". 
[0006] Unfortunately, however, conventional external processors (such as central processing units, or CPUs), are 
not optimized to handle the myriad of normalization and denormalization operations required for cryptography process- 

40 ing. For example, both the BCM 5805™ and Hi/fn™ 6500 are typically configured to process data blocks that are much 
larger than those data blocks that a central processing unit is optimized to handle. 

[0007] Most encryption schemes (such as Diffie-Hellman, RSA, and DSA) commonly have data block sizes on the 
order of 512 to 1024 bits or sometimes larger. Typical central processing units, however, can only handle blocks of 
data of 32 or 64 bits at a time. As one skilled in the art would appreciate, in order to accommodate these large data 

45 blocks, the CPU consumes large amounts of valuable processing. Since software configuration requires copying large 
amounts of data to intermediate storage during normalization and denormalization, the 512 or 1024 bit data blocks 
would be read and copied 32 bits at a time to intermediate storage and subsequently reread and recopied onto an output. 
[0008] The processing of data blocks of 512 or 1024 bit using software configured 32 bit or 64 bit architectures 
substantially reduces cryptography processing throughput and increases software complexity. Furthermore, software 

50 configurations are typically slow, cumbersome, and nontrivial. 

[0009] It is therefore desirable to have a system, method, and apparatus that provides for efficient hardware normal- 
ization and denormalization suitable for high speed cryptography processing. 

Summary of the Invention 

55 

[0010] Methods and apparatus are provided for efficiently normalizing and denormalizing data for cryptography 
processing. The normalization and denormalization techniques can be applied in the context of a cryptography accel- 
erator coupled with a processor. Hardware normalization techniques are applied to data prior to cryptography process- 
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[0025] Broadly speaking, the invention relates to a system, method, and apparatus for efficiently normalizing data 
provided to a cryptography accelerator as well as the denormalizing the corresponding processed data, in one em- 
bodiment, a cryptography accelerator coupled to a processor includes normalization circuitry for writing unnormalized 
data into a buffer in normalized form by shifting the data by a shift amount. Data path circuitry performs cryptography 
5 processing operations on the normalized data in the buffer. Denormalization circuitry coupled with the data path circuitry 
denormalizes the processed data using the shift amount. 

[0026] The invention will now be described in terms of a cryptographic accelerator system that can be implemented 
in a number of ways, such as for example, as a stand alone integrated circuit, as embedded software, or as a subsystem 
included in, for example, a server computer used in a variety of Internet and Internet related activities. It should be 
10 noted, however, that the invention is not limited to the described embodiments and can be used in any system where 
high speed encryption is desired. 

[0027] Figure 1 is a diagrammatic representation of one example of a cryptographic processing system 100 in ac- 
cordance with an embodiment of the invention. As shown in Figure 1 , the present invention may be implemented in a 
stand-alone cryptography accelerator 1 02 or as part of the system 1 00. In the described embodiment, the cryptography 

15 accelerator 102 is connected to a bus 104 such as a PCI bus via a standard on-chip PCI interface. The processing 
system 100 includes a processing unit 106 and a system memory unit 108. The processing unit 106 and the system 
memory unit 1 08 are coupled to the system bus 1 04 via a bridge and memory controller 110. Although the processing 
unit 106 may be the central processing unit or CPU of a system 100, it does not necessarily have to be the CPU. It 
can be one of a variety of processors. A LAN interface 1 1 4 couples the processing system 1 00 to a local area network 

20 (LAN) and receives packets for processing and writes out processed packets to the LAN (not shown). Likewise, a Wide 
Area Network (WAN) interface 112 connects the processing system to a WAN (not shown) such as the Internet, and 
manages in-bound and out-bound packets, providing automatic security processing for IP packets. 
[0028] A cryptography accelerator 102 can perform many cryptography processing computations using what is re- 
ferred to as long integer arithmetic. Long integer arithmetic performs operations n umbers that can be hundreds of digits 

25 longs. For example, public key computations such as Diffie-Hellman, RSA, and DSA, have primitive operations that 
use long integer arithmetic on 1024-bit numbers. Hardware implementations use what is referred to as carry save 
representation to perform long integer arithmetic. Carry save format represents a number using uses two independent 
quantities or values called sum bits and carry bits. At the end of the operation, the sum bits and carry bits are added 
together using regular adders to convert the number back to binary form. In this way, carry save computation avoids 

30 carry propagation until the end of a sum of numbers as well as avoiding resource intensive carry propagation until the 
final step in an operation. Carry save computation and other topics relevant to the present invention are discussed in 
Computer Organization and Design, John Hennessy and David Patterson, Morgan Kaufmann Publishers (1998, 2nd 
Edition), the entirety of which is herein incorporated by reference for all purposes. In addition to using carry save adders, 
the present invention may use a variety of ripple adders, carry lookahead adders, and MSI adders. 

35 [0029] Carry save representation, however can require that data be normalized before computation and denormal- 
ized after computation. Still referring to Figure 1 , the processing unit 106 normalizes data prior to sending the data 
packet to the bus 1 04 by way of the bridge 1 1 0 for cryptography accelerator 1 02. As one of skill in the art will appreciate, 
many cryptography processing operations are based on y=g*mod(n). Each of the values g, x, and n are typically sup- 
plied in normalized form to prior art cyptography processors. Many variations of y=g*mod(n) exist such as y= g*mod 
40 (n)mod(m). 

[0030] Figure 2 shows a packet that can be used by the processing unit 1 06 to transmit g, x, and n along with other 
data to cryptography the cryptography accelerator 102. Packet 201a can contain header 203a along with payioad 
comprising 205a, 207a, 209a, 211 a, and 213a. In the packet shown in Fig. 2, the header 203a contains address and 
length information, the block 205a contains the normalized form of g, the block 207a contains the normalized form of 
45 x , and the block 209a contains the normalized form of n. In the example shown, each block size is a multiple of 32 bits 
and n is 1024 bits in length. Other data can be provided as well in blocks 211a and 213a. 

[0031] According to the present invention, the processing unit 106 does not normalize the data g, x, and n prior to 
transmitting packet 201b to cryptography accelerator 102. Block 205b can contain g, block 207b can contain x, and 
block 209b can contain n. Each block size again can be a multiple of 32 bits and n can be 1028 bits. The processor 

so 106 can provide the positions of the leading ones in each of blocks 205b, 207b, and 209b so that the cryptography 
accelerator 102 can more easily normalize the data. The leading one is the most significant one in a string of bits. For 
example, in the string 0101 , the leading one would be the second digit from the left. As will be appreciated by one of 
skill in the art, other information can be provided by the processor 106 to cryptography accelerator 102. For example, 
the length of each block can also be provided. 

55 [0032] Figure 3 is a diagrammatic representation of one embodiment of a cryptography accelerator 102 that can 
receive the packet 201b containing data that has not yet been normalized. A cryptography accelerator 102 interfaces 
with components described in Figure 1 through an interface such as a PCI interface 302. 

[0033] According to various embodiments, a normalization and denormalization system 304 is coupled with the bus 
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interface 302 to receive data that has not yet been normalized. The normalization and denormalization system 304 is 
used to receive data from packet 201 b for public key processing. The components in packet 201 b are normalized prior 
to cryptography processing and denormalized after cryptography processing using the register files. The register files 
and other components in the normalization and denormalization system 304 system will be described further below 
with reference to Figures 4-6. 

[0034] In the described embodiment, the cryptography accelerator 102 can include a key setup execution unit such 
as a DH(Diff ie-Hellman)/RSA/DSA unit 306 and a random number generator unit 308 to facilitate the public key process- 
ing. It is a well established fact that a hardware random number generator 308 is better able to produce numbers in a 
more random fashion than is a software random number generator. The key setup execution unit 306 accelerates the 
public key operations and the random number generator unit 308 generates secure private keys. A number of both 
public-key and private-key operations can be performed in parallel. Although not shown in Figure 3, the cryptography 
accelerator 102 can include buffers along with the various other components. The buffers can be used to handle the 
long latency periods during public-key and private-key operations. Other components can be used for context and data 
handling. In one embodiment, RSA private key operations are performed in parallel on the same chip. 
[0035] The cryptography accelerator 102 can also use cell based processing as described in co-pending U.S. Ap- 
plication No. 09/51 0,486, entitled "Security Chip Architecture And Implementations For Cryptography Acceleration" at 
the time of filing on February 23, 2000, the entirety of which is hereby incorporated by reference for all purposes. 
Context information needed to process the current packet is read in and stored in the pre-fetch context buffer,31 6. The 
cryptography accelerator 1 02 can include cryptography engines 31 0 and 312 along with other engines. In one embod- 
iment, the cryptography engine 310 is a "3DES-CBC" unit 310 that provides encryption and decryption of incoming 
packets and the cryptography engine 312 is a "MD5/SHA1" unit 312 that provides authentication and digital signature 
processing. It should be note that in addition to the cryptography units shown, any other current or future algorithms 
may be supported in the cryptography accelerator 102. For in-bound packets received from an outside source such 
as another computer or an external network, the cells can be first authenticated and then decrypted in parallel fashion. 
For out-bound packets destined for an outside source, the cells can be first encrypted then authenticated, again in 
pipelined fashion. The sequencing of the data processing and pre-fetching is controlled by a microcontroller^ 4, and 
the program code ensures that the cryptography engines are continually provided with cells and context information. 
[0036] The cryptography accelerator 1 02 can also contain additional components for normalization and denormali- 
zation. For example, an arithmetic logic block can be coupled to the normalization and denormalization system for 
cryptography processing. Alternatively, specific arithmetic logic units can be integrated into the normalization and de- 
normalization system 304. 

[0037] Figure 4 describes one embodiment of a normalization and denormalization 304 system having integrated 
arithmetic logic u nits in accordance with an embodiment of the in vention . The normalization and den ormalization system 
304 includes normalization unit 401 for normalizing data. As noted above, normalizing data typically comprises shifting 
bits so that a leading one becomes the most significant bit. For example, an unnormalized data word D1 "001 01 111" 
after normalization becomes a normalized data word D1 n "1 01 11100" where context circuitry 403 tracks the shift 
amount. In the example, the unnormalized data word D1 "00101111" is normalized to the normalized data word D1 n 
"10111 100." The corresponding shift amount is two bits. Bits can be shifted using conventional barrel shifters or bits 
can be shifted on the fly as data is written from the data packet to the buffer 41 3. In one example, the buffer contains 
40 register files 407. The register files 407 can comprise four 1 028 bit blocks. 

[0038] According to various embodiments, the shift amount is provided in data packet 207b. In one example, data 
can be written to register files 407 in normalized form. The shift amount is tracked using context circuitry 403. The 
normalized data is processed by cryptography processing unit 409. According to various embodiments, multiple cryp- 
tography processing units can be used with a single normalization unit 401 and a single denormalization unit 405. 
Cryptography processing unit 409 can use carry save computation. As noted above, carry save computation defers 
carry propagation until the final step. After data is processed by cryptography processing unit 409, regular adders can 
be used for carry propagate computation at 41 1 . The resulting data can be written to register files 407. The denormal- 
ization unit 405 uses the shift amount stored in context circuitry 403 and denormalizes the data in the register files 407. 
[0039] Figure 5 is a diagrammatic representation of normalizing data from a data such as a data packet to register 
files 407 in accordance with an embodiment of the invention. Data 509 may be provided to the cryptography accelerator 
by the central processing unit 1 06 or some other processor in a packet such as the packet shown in Figure 2. According 
to various embodiments, the length of data 509 is M which is typically 1024 bits. The length of data beginning from the 
leading one to the least significant bit is N. The blocks 511, 513, 515, 517, 521, 523, 525 and 527 can be 32 bits in 
length. It should be noted that block 523 can represent multiple blocks. Blocks 533, 535, 537, 539, 541 , 543, 555, and 
557 in register file 531 are also 32 bits in length. Similarly, blocks 537 can represent multiple blocks. The data 501 and 
the register file 531 can both comprise 32 blocks. Blocks 527 and 557 containing the least significant bits of data 509 
and register file 531 are herein referred to as the least significant blocks or block 0. Similarly, blocks 511 and 533 
containing the most significant bits of data 509 and register file 531 are herein referred to as the most significant blocks 
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or block 31 . 

[0040] Both M 501 and N 503 can be provided in the data packet 207b received by the cryptography accelerator. 
According to one embodiment, blocks 511 , 513, and 515 all contain zeros while 517 contains 16 zero bits. In other 
words, blocks 28-31 all contain zeros. The bits following the leading one in block 28 through block 0 in data 509 are 
written to block 31 through block 3 in register file 531 . The zeros contained in block 31 through block 28 in data 509 
are written to blocks 3 through block 0 in register file 531 . 

[0041] According to various embodiments, data 509 is written to register file 531 "on the fly." As a block of bits are 
read from data 509, a block of bits are written to register file 531 . The following pseudo code implemented in hardware 
can perform normalization "on-the-fly" by reading and writing blocks of bits: 



r = m/32; 

is s = n /32; 

shf= n % 32; 
if(shf !=0) { 
20 dinjd = 0; 



25 



30 



} 

else 



for ( i = 0; i < 32; i++) { 

addr = (r-s- 1 + i ) % 32; 
din = next_word(); 
data = din « 32 | din_d; 
data = (data » shf) & Oxffffffff; 
write_register (addr, data); 
din_d = din; 

} 



for (i = 0; i < 32; i++) { 
addr = (r - s + i) % 32; 
35 data = next_word(); 

write_register (addr, data); 
} 

40 

[0042] Figure 6 is a flow diagram implemented in hardware describing aspects of the pseudo code for normalizing 
data, according to various embodiments. Figure 6 will be described with reference to Figure 5 and the pseudo code. 
The normalization process 600 begins by identifying R, S, and the shift amount. R is equal to the length of the data M 
501 divided by the number of bits per data block. In other words, R is equal M divided by 32. S is equal to the number 
45 of bits N 503 divided by the number of bits per data block. In other words, S is equal to N divided by 32. The shift 
amount 505 is the modulus of N and 32. According to various embodiments, R, S, and the shift amount may be provided 
to the cryptography accelerator by another processor, such as a central processing unit. 

[0043] At 603, if the shift amount is zero, blocks of bits from data 509 can be written as blocks of data to 531 without 
shifting bits within each block. If the shift amount is zero, a counter I is set to 0 at 605. If I is less than 32 representing 

so the number of blocks in data 509, block I is read from data 609. Block I is then written to register file ((R-S+l)%32) at 
611 I is then incremented by 1 at 613 and the process continues at 607. For example, when I is 0, and R and S are 
32 and 29 respectively, data 509 has 29 blocks of data following the leading one and three blocks of data preceding 
the leading one. When I is 0, block 0 of data 509 is written to block 3 of register file 531 , since (32-29+0)%32) is equal 
to 3. When I is incremented by 1 at 61 3, block 1 from data 509 is read and written to block 4 of register file 531 , since 

55 (32-29+1 )%32) is equal to 4. The process continues until block 31 is read from data 509 and written to block 2, since 
((32-29+31 )%32) is equal to 2. The blocks of register file 531 are written starting at block 3 through block 31 and 
subsequently from block 0 through block 2, according to specific embodiments. 

[0044] The normalization of data proceeds similarly even when the shift amount is not zero at 603. At 615, a value 
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Din_d and I are both set to 0. While I is less than 32 at 61 7, block I is read from data 509. The data in block I is left 
shifted 32 bits and a bitwise OR is performed with the contents of Din_d at 621 . The result of 621 is then right shifted 
by the shift amount and a bitwise AND is performed with Oxffffffff at 623. 

[0045] For example, in a system with 8 bit blocks sizes, a shift amount of 4. 1=0, and block 0 containing 1 001 001 1 
D.n_d would initially contain 0000 0000. Shifting block 0 a total of 8 bits to the left and performing a bitwise OR would 
y.eld 1001 0011 0000 0000. Right shifting the result 1001 0011 0000 0000 by the shift amount of 4 bits would yield 
1 001 001 1 0000. Performing a bitwise AND with the number Oxff or 1 1 1 1 1 1 1 1 would yield 001 1 0000 
[0046] The result at 623 is then written to register file block 28, since ((32-29-1 +0)%32) is 28 Din_d gets the value 
of block 0 at 627 and I is incremented by 1 at 629. Returning to the above noted 8 bit example I is now 1 Block 1 is 
read from data 509 and is found to contain 1010 0101. Shifting block 1 a total of 8 bits to the left would yield 1010 0101 
0000 0000. The value of Din_d was the value of block 0. 1 001 0011 . Performing a bitwise OR on shifted block 1 and 
D.n_d would yield 1010 0101 1001 0011. Right shifting by the shift amount of 4 would yield 1010 0101 1001 and 
performing a bitwise AND operation with Oxff or 11 11 111 1 would yield 0101 1001 . The result at 623 is then written to 
register file block 29, since ((32-29-1 +1)%32) is 29. The process continues until I is equal to 32 and all blocks of data 
509 have been read and written to register file blocks 513. 

[0047] The above noted pseudo code and Figure 6 describes normalization for 32 bit blocks. However the techniques 
of the present invent.on can easily be adapted to handle various embodiments including systems using different size 
blocks. As will be appreciated by one of skill in the art, a variety of implementations can also be used to perform the 
techniques of the present invention. For example, the condition where the shift amount is equal to 0 does not need to 
be checked, since the condition can be handled using the same bit shifting technique described for shift amounts not 
equal to zero. 

[0048] As will be appreciated by one of skill in the art, a process for hardware denormalization is similar to the 
techniques described for hardware normalization. The normalization techniques of the present invention described 
with reference to Figure 6, Figure 5. and the pseudo code can be adapted for use as denormalization techniques 
Context circuitry can track the shift amount for a denormalization process to convert data in register files back into 
denormalized form. 

[0049] As noted above, data is normalized priorto processing in a cryptography processing unit 409 shown in Figure 
4 and denormalized after processing. Figure 7 shows one example of a cryptography processing unit As will be ap- 
preciated by one of skill in the art, a fundamental cryptography computation step is P = A ' B mod N According to 
vanous embodiments, it can be difficult to multiply two 1024 bit numbers, perform carry propagation, and then take a 
modulus. Instead, the fundamental cryptography computation step can be separated into iterations of the following: 

P' = 4*P + A*Booth(B); 

and 
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P" = P" - estimate(k) * N. 

[0050] According to various embodiments, the data path of Figure 7 shows one example of a system for performing 
computat.on of P. Booth encoding block 701 multiplies A by Booth encoded B. The number of partial products needed 
for performing multiplication is reduced by half when Booth encoding block 701 is used. The time required for multipli- 
cation using Booth encoding is substantially less than the time required for typical multiplication schemes Booth en- 
coding | « descnbed in Computer Organization and Design, John Hennessy and David Patterson, Morgan Kaufmann 
Publishers (1998, 2nd Edition) which is incorporated by reference for all purposes in its entirety 
[0051] Block 703 represents a bit shifter that allows multiplication and division by factors of 2. Block 703 can perform 
4 • P as well as division by 2. As will be appreciated by one skilled in the art, left shifting the bits in a binary number 
by one bit .s equivalent to a multiplication by two. Left shifting by two bits is equivalent to multiplication by four Similarly 
right shifting by one bit is equivalent to division by two. The 4 * P represented by block 703 and the A * Booth(B) 
represented by block 701 are summed using adder 705. According to various embodiments, estimator 707 looks at 
the 11 most significant bits of a data block to form an estimation of an adjustment factor. The estimator is described in 
Hardware Implementation, Cetin Kaya Koc, TR 801 , RSA Laboratories. 30 pages, April 1 996, the entirety of which 
is mcorporated by reference for all purposes. The estimate(k) and N are provided along with p' to carry save adders 
709 and 711 As noted above, carry save adders can defer carry propagation until the final step. Carry save adders 
use carry bits and sum bits stored in carry bit and sum bit registers 713 also referred to as carry save accumulators 
The result can then be passed back to block 703 for multiplication by shifting. 

[0052] As noted above, the data path described in Figure 7 can precede a carry propagation block for adding the 
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carry bits and the sum bits using conventional adders. The result from the carry propagation block can then be denor- 
malized by a denormalization unit using context circuitry as described in Figure 4. It should also be noted that many 
elements shown in Figure 7 are optional, or can be replaced with comparable components. For example, the Booth 
encoding block can be replaced by shifters and adders. 

[0053] While the invention has been particularly shown and described with reference to specific embodiments thereof, 
it will be understood by those skilled in the art that changes in the form and details of the disclosed embodiments may 
be made without departing from the spirit or scope of the invention. For example, the embodiments described above 
may be implemented using firmware, software, or hardware. Moreover, embodiments of the present invention may be 
employed with a variety of communication protocols and should not be restricted to the ones mentioned above. There- 
fore, the scope of the invention should be determined with reference to the appended claims. 



Claims 

'5 1 . A normalization/denormalization circuit included in a cryptography accelerator unit coupled to an external proces- 
sor, comprising: 

a normalization sub-circuit arranged to generate normalized data based upon corresponding unnormalized 
data; 

20 a context sub-circuit coupled to the normalization sub-circuit for characterizing the normalized data in relation 

to the unnormalized data; and 

a denormalization sub-circuit coupled to the context sub-circuit arranged to provide the unnormalized data 
based upon the normalized data and the characterization, wherein the normalization/denormalization circuit 
efficiently provides a normalization/denormalization service to the cryptography accelerator unit such that sub- 
25 stantially no external processor resources are used to normalize or denormalize data. 

2. A circuit according to claim 1 , wherein the normalization sub-circuit generates the normalized data on the fly by, 

receiving the unnormalized data from the processor, and 
30 shifting the unnormalized data by a shift amount to form the normalized data. 

3. A circuit according to claim 2, wherein the context sub-circuit characterization includes tracking the shift amount. 

4. A circuit according to any of claim 2-3, wherein the denormalization sub-circuit uses the shift amount to modify the 
35 normalized data to form the unnormalized data. 

5. A circuit according to any of the preceding claims, wherein the cryptography accelerator unit further includes, 

a register unit coupled to the normalization sub-circuit coupled to the normalization sub-circuit and the denor- 
40 malization sub-circuit arranged to store the normalized data, and 

a cryptography processing unit coupled to the register unit configured to perform cryptography processing on 
normalized data received from the register unit. 

6. A circuit according to any of the claims 2-4, wherein the shift amount is equivalent to the number of zeros more 
4 5 significant than the leading one of the corresponding unnormalized data. 

7. A circuit according to any of the preceding claims, wherein the cryptography processing unit includes a carry save 
adder unit. 

50 8. A circuit according to any of the preceding claims, further including a carry propagation sub-circuit coupled to the 
cryptography processing unit. 

9. A circuit according to any of the preceding claims, further including: 



55 



data path circuitry coupled to the normalization circuitry, the data path circuitry for performing cryptography 
processing on the normalized data in a buffer. 

10. A circuit according to claim 9, wherein the data is written on-the-fly into the buffer in normalized form. 
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1 1 . A circuit according to any of claims 9-1 0, wherein the buffer is a register fife block. 

12. A circuit according to any of claims 9-11 , wherein data path circuitry comprises one or more carry save adders. 

13. A circuit according to any of the preceding claims, wherein the shift amount corresponds to the number of zeros 
more significant than the leading one of the data. 

14. A circuit according to any of the preceding claims, further comprising 3DES-CBC encryption/decryption and 
MD5/SHA1 authentication/digital signature processing blocks. 

15. A circuit according to any of the preceding claims, wherein the processing comprises Diffie-Hellman/RSA/DSA 
public key processing. 

16. A method for performing normalization/denormalization in a cryptography accelerator unit coupled to an external 
processor, the method comprising: 

generating normalized data corresponding to unnormalized data; 
characterizing the normalized data in relation to the unnormalized data; and 

providing unnormalized data based upon the normalized data and the characterization, wherein the normali- 
zation/denormalization circuit efficiently provides a normalization/denormalization service to the cryptography 
accelerator unit such that substantially no external processor resources are used to normalize or denormalize 
data. 

17. A method according to claim 1 6, wherein the generating normalized data is performed on the fly by, 

receiving the unnormalized data from the processor, and 

shifting the unnormalized data by a shift amount to form the normalized data. 

18. A method according to any of claims 16-17, wherein characterizing the normalized data includes tracking the 
shift amount. 

19. A method according to any of claims 16-18, wherein providing unnormalized data uses the shift amount to modify 
the normalized data to form the unnormalized data. 

20. A method according to any of claims 16-19, further comprising: 

performing cryptography processing on normalized data received from a register unit configured to store nor- 
malized data. 

21. A method according to any of claims 18, wherein the shift amount is equivalent to the number of zeros more 
significant than the leading one of the corresponding unnormalized data. 

22. A method according to any of claims 16-21, wherein cryptography processing is performed on normalized data 
using carry save computation. 

23. A method according to any of claims 1 6-22, wherein cryptography processing is performed using carry propagation . 
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Figure 6 
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